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Abstract 



PReMiuM is a recently developed R package for Bayesian clustering using a Dirich- 
let process mixture model. This model is an alternative to regression models, non- 
parametrically linking a response vector to covariatc data through cluster membership 
(Molitor et al. 2010). The package allows Bernoulli, Binomial, Poisson, Normal and cat- 
egorical response, as well as Normal and discrete covariates. Additionally, predictions 
may be made for the response, and missing values for the covariates are handled. Several 
samplers and label switching moves are implemented along with diagnostic tools to assess 
convergence. A number of R functions for post-processing of the output are also provided. 
In addition to fitting mixtures, it may additionally be of interest to determine which co- 
variates actively drive the mixture components. This is implemented in the package as 
variable selection. 

Keywords: Profile regression, Clustering, Dirichlet process mixture model. 



Profile regression is an alternative to regression models for when one tries to make inference 
beyond main effects for datasets with potentially correlated covariates. In particular, profile 
regression non-parametrically links a response vector to covariate data through cluster mem- 
bership (Molitor et al. 2010). We have implemented this method in the R (R Core Team 
2012) package PReMiuM. 

PReMiuM performs Bayesian clustering using a Dirichlet process mixture model and it allows 
Bernoulli, Binomial, Poisson, Normal and categorical response, as well as Normal and discrete 
covariates. Moreover, predictions may be made for the response, and missing values for the 
covariates are handled. Several samplers and label switching moves are implemented along 
with diagnostic tools to assess convergence. A number of R functions for post-processing of 
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the output are also provided. In addition to fitting mixtures, it may additionally be of interest 
to determine which covariates actively drive the mixture components. This is implemented 
in the package as variable selection. 

In order to demonstrate the PReMiuM package, it is helpful to present an overview of the 
Dirichlet Process. We begin this section by re-familiarising the reader with such a process, 
introducing notation that we shall call upon throughout the paper. Formally, let (f2, J-, P) 
be a probability space comprising a state space O with associated u-field J- and a probability 
measure Po- We say that a probability measure P follows a Dirichlet process with concentra- 
tion parameter a and base distribution Pq parametrised by 60, written P ~ DP(a,Pe ) if 

(P(A 1 ),P(A 2 ), . . . ,P(A r )) ~ Dirichlet(aP eo (^i), 0^60(^2), • • • , aPe (A)) (1) 
for all Ai,A 2 , ... ,A r E T such that Ai D Aj = for all % / j and [fj=i A j = 

1.1. The stick-breaking construction 

Although Definition (1) is perhaps rather abstract, proof of the existence of such a process 
has been determined in a variety of ways, using a number of different formulations (Ferguson, 
1973 and Blackwell and MacQueen, 1973). In this paper we focus on Dirichlet process mixture 
models (DPMM), based upon the following simplified constructive definition of the Dirichlet 
process, due to Sethuraman (1994). If 

00 

P = $>c<*e c , 

c=l 

6 C ~ P Qo for c 6 Z+, 

^ c = VcHil-Vi) forcGZ+\{l}, (2) 

1<C 

tpi = Vi, and 

V c ~ Beta(l,a) for c G Z + , 

where S x denotes the Dirac delta function concentrated at x, then P ~ DP(a,Peo)- This 
formulation for V and if) is known as a stick-breaking distribution. Importantly, the distri- 
bution P is discrete, because draws ©1, 02, • • . from P can only take the values in the set 
{0 C : c G Z+}. 

As noted by many authors (for example Ishwaran and James, 2001 and Kalli et al, 2011) it 
is possible to extend the above formulation to more general stick-breaking formulations, for 
example allowing V c ~ Beta(a c , b c ) independently, resulting in a generalised Dirichlet process, 
such as the two parameter Poisson-Dirichlet process (Pitman and Yor 1997). Another example 
is provided by Chung and Dunson (2009) and Rodriguez and Dunson (2011) where the Beta 
distibutions are replaced by a Probit model, which may be allowed to depend on observed 
data, for the purpose of capturing dependence such as a spatial structure. The methods and 
results that we propose within this paper hold for such generalised processes, but at present 
the package is only coded to implement the Dirichlet process in equation (2). 

Typically, because of the complexity of the models based on the stick-breaking construction, 
inference is made in a Bayesian framework using Markov chain Monte Carlo (MCMC) meth- 
ods. Until recently, a perceived difficulty in making inference about this model was the infinite 
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number of parameters within the stick breaking construction. Historically, this obstacle has 
resulted in the use of algorithms that either explore marginal spaces where some parame- 
ters are integrated out or use truncated approximations to the full Dirichlet process mixture 
model, see for example Neal (2000) and Ishwaran and James (2001). 

More recently, two alternative innovative approaches to sampling the full DPMM have been 
proposed. The first, introduced by Walker (2007), uses a novel slice sampling approach, re- 
sulting in full conditionals that may be explored by the use of a Gibbs sampler. The difficulty 
of the proposed approach is the introduction of contraints that complicate the updates of 
the mixture component weights, leading to potential mixing issues. To overcome this, Kalli 
et al. (2011) generalise this sampler, adding further auxilliary variables, and report good 
convergence results, although the authors note that the algorithm is sensitive to these addi- 
tional parameters. The second distinct MCMC sampling approach was proposed in parallel 
by Papaspiliopoulos and Roberts (2008). The proposed sampler again uses a Gibbs sam- 
pling approach, but is based upon an idea termed retrospective sampling, allowing a dynamic 
approach to the determination of the number of components (and their parameters) that 
adapts as the sampler progresses. The cost of this approach is an ingeneous but complex 
Metropolis- within- Gibbs step, to determine cluster membership. 

Despite the apparent differences between the two strategies, Papaspiliopoulos (2008) noted 
that the two algorithms can be effectively combined to yield an algorithm that improves 
either of the originals. The resulting sampler was implemented and presented by Yau et al. 
(2011), and a similar version was presented by Dunson (2009) for DPMM. The current sampler 
presented in this paper is our interpretation of these ideas, implemented as an R package. This 
package, called PReMiuM, is based upon efficient underlying C++ code for general DPMM 
sampling and it is available on CRAN. 

In Section 2 we describe formally the Dirichlet process mixture model implemented in PRe- 
MiuM and the blocked MCMC sampler used. In Section 3 we discuss profile regression and 
its link with the response and covariate models included in the package. In Section 4 we give 
an overview of the postprocessing tools available in PReMiuM to learn from the rich output 
produced by our Bayesian model and in Section 5 we discuss diagnostic tools that we propose 
to investigate the convergence of the MCMC. Finally, in Section 6 we give a brief overview of 
the structure of the code and show examples of its use as well as an indication of run times 
in Section 7. 

2. Sampling the Dirichlet process mixture model 
2.1. Definition and properties 

Perhaps the most common application of the Dirichlet process is in clustering data, where 
it can be used as the prior distribution for the parameters of an infinite mixture model. 
Consider again the stick breaking construction in Equation 2. For the Dirichlet process 
mixture model (DPMM), the (possibly multivariate) observed data D = (D\, D2, ■ ■ ■ , D n ) 
follow an infinite mixture distribution, where component c of the mixture is a parametric 
density of the form / c (-) = /(-|O c , A) parametrised by some component specific parameter C 
and some global parameter A. Defining (latent) parameters Gi, 02, ■ • • , @ n as draws from a 
probability distribution P following a Dirichlet process DP(a, Pq ) and again denoting the 
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dirac delta function by 5, this system can be written, 

Di\e h A ~ /(A|e,,A) fori = 1,2,..., n, 



. n, 



@« ~ ^e e for % = 1,2, . . . ,! . 

0=1 

e c ~ P© forcGZ + , 

^ c = ^Ila-^) forcGZ+\{l}, 

■01 = V\, and 

V c ~ Beta(l.a) for c G Z+. 



(3) 



When making inference using mixture models (either finite or infinite) it is common practice 
to introduce a vector of latent allocation variables Z. Such variables enable us to explicity 
characterise the clustering and additionally facilitate the design of MCMC samplers. Adopting 
this approach and writing ip = {ipi,tp2, ■ ■ ■) and = (0i, 02, • • •), we re-write Equation 3 as 

Di\Z,&,A ~ /(A|6^,A) fori = l,2,...,ra, 

G c ~ P @0 for c£Z + , 

P(Zi = c\i/>) = ip c force Z+, i = 1,2,..., n, 

0c = VcHil-VO forceZ+\{l}, 

1<C 

01 = Vi, and 
V c ~ Beta(l,a) for c G Z + . 



(4) 



The likelihood of Di associated with the DPMM is simply the first line of Equation 4. Inte- 
grating out the latent variable Zi we obtain the more recognisable mixture likelihood 

oo 

p(A|^,®,A) = J2p(D i \Z i = c,@,A)p{Z i = c\iP) 

c=l 
oo 

= ^Vc/(A|e CJ A). 



c=l 



The remainder of Equation 4 provides the prior specification for the DPMM, allowing us to 
write the joint posterior distribution as 

p(Z,®,A,V,a\D) oc p(U|Z,e,A)p(Z ) V,a,e,A|e ) 

n oo 

oc ]J{f(D i \@ z .,A)p(Z i \V)}l[{ P (V c \a)p(e c \@ )} P (A)p(a) 



i=l 



c=l 



« nh(A|ez i; A) 



i=l 



v Zi n (i - 



(5) 



[J{(l-F c )«-ye c |G )}p(A)p(a). 



c=l 
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Of course, additional layers of hierarchy could easily be introduced, for example through 
hyper- priors for ©o- 

2.2. The MCMC sampler 

A common approach to MCMC sampling from the DPMM is to integrate out V and use a 
Gibbs sampler on the resulting space. Such samplers are commonly referred to as Poly a urn 
samplers, since they are motivated by the Polya urn representation of a Dirichlet process in- 
troduced by Blackwell and MacQueen (1973). Ishwaran and James (2001) provide a review of 
this approach and demonstrate that conditionals of the type p(Zi\Z-i) can be derived, where 
Z-i = (Zi, . . . , Zi-i, Zi+i, Z n ). Many other authors have focused on developing alternative 
samplers of this nature, including Neal (2000) and Green (2010). However, as noted by many 
authors, samplers of this nature, where the allocation of a single observation is conditional on 
the allocations of all other observations, can often suffer from poor mixing. This motivates 
the need for an alternative class of samplers that sample from the full model 4. 

In the full model, the posterior conditionals for the allocation variables Z depend upon an 
infinite number of variables V and 0. One way to bypass this complication is to truncate the 
definition in Equation 4 to mixtures with C components (of which potentially only a subset 
will be non-empty). Ishwaran and James (2001) demonstrate that under this model the 
conditional distributions are standard distributions which can be easily sampled from. This 
is one of the three samplers implemented in our R package, and we refer to it as truncated. 
Although this approach resolves the challenges of sampling from the full model, if C is not 
chosen to be sufficiently large, then the posterior may be quite different on the truncated 
model space compared to the full model space. The authors provide some guidance for choice 
of C, but more recent work by Walker (2007) and Papaspiliopoulos and Roberts (2008) 
demonstrate techniques that alleviate the need for such a truncation, whilst retaining many 
of the sampling properties of the full conditionals. 

The first step is to borrow the idea of Walker (2007) and introduce auxiliary variables U = 
(Ui, U2, • • • , U n ) such that the joint posterior can be re-written as 

n 

p(U,Z,®,A,V,a\D) cx J] {/(A|6 Zi , A)l {Di< ^ } } (6) 

i=l 

00 

x II K 1 - V c r- 1 p(Q c \e )}p(A)p(a) 

c=l 

n 
i=l 

oo 

x II {C 1 - ^) a -Ve c |0o)}p(AMa). 

c=l 

Here, 1a, is the function that takes the value 1 over the set A and elsewhere. By combining 
this auxiliary variable approach with the notion of retrospective sampling (i.e. adopting a 
just-in-time approach to sampling empty mixture component parameters, as introduced in 
Papaspiliopoulos and Roberts, 2008), it is possible to construct an efficient Gibbs sampler 
for sampling from the joint posterior 7. Integrating U out of Equation 7 with respect to the 
Lebesgue measure yields the DPMM posterior distribution given in Equation 5 meaning that 
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marginalising samples of the joint distribution over U results in samples from the desired 
distribution. 

Kalli et al. (2011) extend the idea of Walker (2007) to a general class of slice samplers by 
writing 

n 

p(U,Z,®,A,V,a\D) oc {f(Di\e Zi , A)l {Ui< ^ } ^} (8) 

i=l 

oo 

x n K 1 - y c) a ~ V© c |e )}p(AM«) 

c=l 

where £i, £2? ■ • • is any positive sequence. 

When £i = ipi, this corresponds to the efficient Gibbs sampler proposed by Papaspiliopoulos 
(2008) in the context of DPMM using parameter blocking. This is the second of the three 
samplers implemented in our R package, and we refer to it as slice dependent, in accordance 
with Kalli et al. (2011). Most importantly, this design permits the introduction of label 
switching moves, without which it is very difficult to obtain sufficient mixing. We discuss this 
in detail in Section 5. 

For the last of the three samplers implemented in our R package, that we refer to as slice 
independent in accordance with Kalli et al. (2011), we set £ = (k — 1)k j_1 with k = 0.8 as 
proposed by Kalli et al. (2011). 

We continue by defining some new notation which is required to present our slice samplers. 
First, given the allocation variables Z, define 

Z* = max Z{. 

\<%<n 

Similarly, given the auxiliary variables U and the vector V, define 

U* = min Ui. 

l<i<n 

and 

C* = minj CG Z+ : J>>l-£/*l (9) 



1=1 



min \c£ Z + : 



1=1 



Vil[(l-V r ) 

r<l 



>1-U* 



It is important to emphasise that these values potentially change at each sweep of the sampler, 
as the underlying variables change, although for simplicity of exposition we have omitted 
explicitly labelling the parameters with the sweep. The purpose of the variable C* is to provide 
an upper limit on which mixture components need updating at each sweep. Specifically, 
although there are infinitely many component parameters in the model, since P(Zi = c\Ui > 
ip c ) = 0, we need only concentrate our updating efforts on those components c for which 
ip c > U{ for some i = 1,2, ... ,n. By defining C* as in equation 9 it can be shown (see 
Appendix A.l) that ijj c < Ui for all c > C* and all i = 1, 2, . . . , n. Assuming that the Markov 
chain is initialised accordingly and is updated using the correct conditionals, it is possible to 
show Z* < C* < oo almost surely (details provided in Appendix A.l). 
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With these definitions in place we make use of the following sets and vectors (which again 
will change at each sweep) 

A = {c G Z + : c < Z*}, P = {ceZ + : Z* <c< C*}, I = {c G Z+ : c > C*} 

v A = (v 1 ,v 2 ,...,v z *), & A = (ei,e 2 ,...,e z *) 
v p = (v z *+i, v z *+2, ...,vc*), ® p = (e z *+i, &z*+2, . . . , e c *) 
v 1 = (Vc*+i, v c *+2, ■■■), & 1 = (e c *+i, ec*+2, . . .)■ 

Here the A, P and / are disjoint sets (updated at every sweep of the MCMC algorithm) 
that partition Z + , with names chosen to denote Active, Potential and Inactive components 
respectively. It is possible that P = 0. By definition, all observations are allocated to a 
mixture component labelled by one of the indices in A. Components labelled with an index 
in P or I are necessarily empty (i.e. have no observations allocated to them), the difference 
being that at the next update of the allocation variables, components with labels in P may 
potentially become non-empty. 

The blocked infinite DPMM algorithm can now be defined using the following blocked Gibbs 
updates to sample from the relevant conditionals. 

DPMM algorithm 

Suppose we are at sweep t of the sampler. Update as follows: 
Step A. Compute Z* and the set A. 

Step B. Sample (V A +1 , Z, U t+1 ) ~ p(V A , & A , Z, U\ Vf , V(, ©f , @(, a t , A t , 9 , D). 

B.l V A ~p(V A \Z t ,a t ) 

B.2 @ A ~p{® A \Z t ,A t ,Q ,D) 

B.3 (V A +1 , @ A +1 , Z) ~ p(V A , @ A , Z\V?, 0f , ®i, a t , At, @o, D) 

B.4 U t+ i ~ P (U\V A +1 ,Z) 

Step C. Compute U*. Recompute Z* and the set A. 

Step D. Sample (a t+1 , Vf +1 , V( +1 ) ~ p(a, V p , V f \V A +1 , & A +1 , &(, U t+1 , Z, A t , 6 , D), com- 
puting C* and the set P in the process. 

B.l a t+1 ~ p(a\Vf +1 ) 

D. 2 Vf +1 ~p{V p \a t+1 ) 

Step E. Sample (0f +1 , © t 7 +1 ) ~ p(® p , J | V A +1 , Vf +1 , V( +1 , & A +1 , Z, a t+1 , A t , O , D). 

E. l 0f +1 ~p(0 p |e o ) 

Step F. Sample A t+1 ~ p(A\V A +1 , V P +1 ,V( +1 ,® A +1 ,& P +1 , & T t+1 , Z,a t+1 ,Q ,D). 

F. l A~p(A\® A +1 ,Z,D) 

Step G. Sample Z t +i ~ p(Z\V A +1 , Vf +1 , V{ +1 , ®t+i, Ut+i,oct+i,A t+ i,&o, D). 
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G.l Z t+1 ~p(Z\V? +li Vf +1 ,®t +l , &f +1 ,U t+l , A t+1 ,D) 

While this algorithm is somewhat generic, the blocking strategy is clearly highlighted. Fur- 
ther details explaining each of the steps are provided in Appendix A. 2. The key idea is 
that by doing joint updates, we can marginalise out an infinite number of variables when 
necessary, to ensure that we are always sampling from conditional distributions that depend 
only upon a finite number of parameters. In particular, after marginalisation, the parameters 
corresponding to the inactive set / do not contribute to the conditional distributions of the 
other parameters, so we do not actually need to sample their values. Since these parameters 
have no contribution to the likelihood, if values are subsequently required they can simply 
be sampled from the prior retrospectively as necessary. Although the sampler is written as a 
blocked Gibbs sampler, where it is not possible to sample directly from full conditionals (for 
example in the update of 0, depending upon the choices of / and P@ ) Metropolis- within- 
Gibbs steps are applied. Depending on the application, the Gibbs updates specified above 
may comprise several different Gibbs or Metropolis-within-Gibbs steps (for example updating 
and A. Typically, where Metropolis-Hastings updates are required we advocate adopting 
an adaptive Metropolis-Hastings approach: see Andrieu and Thorns (2008) for a review. 



3. Example Models 

The general sampler of the previous section is applicable for many specific models, depending 
on the choices of / and P@ . In this section we provide further details for some of the models 
that are implemented within our software. We detail the prior choices that are made within 
our implementation, but, of course, alternative priors could be chosen. 

3.1. Gaussian mixtures 

Perhaps the most common model to be implemented under the DPMM framework is the 
Gaussian mixture model, where D = X for some covariate data X, and X assumes a mixture 
of Gaussian distributions. Under this setting for each cluster c, the cluster specific parameters 
are given by G c = (/u c , S c ), where [i c is a mean vector and T, c is a covariance matrix. There 
are no additional global parameters A. Under this setting 

p{X^ S Zi , A) = f(X^ Zi , S Z J = (2<K)-i |E Zi |-i exp j-^X, - /^fS^pQ - » Zi )\ . 

(10) 

By choosing /i c ~ Normal(/io, ^o) and S c ~ InvWishart(f?o, «o) (for each c) for our prior 
model Pq we have a conjugate model, permitting Gibbs updates for the parameters fj, A 
and Y1 A associated the active components, and also those (n P and X! p ) associated with the 
potential components. The choice of values for the hyperparameters Go = (no, Go, Ro, «o) is 
discussed further in section 6. 

3.2. Discrete mixtures 

Clearly the DPMM model applies to mixtures other than simply the Gaussian case. Consider 
for example the case where for each individual i, Di = Xi is a vector of J independent discrete 
categorical random variables, where the number of categories for covariate j = 1, 2, . . . , J is 
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Kj. Then we can write 6 C = $ c ,= (</>c,i,i, 4>c,i,2, ■ • -,^,1,^1,^,2,1, • • ■ , 4>c,2,Ki, <f>c,J,Ji, ■ ■ -,(f>c,J,Kj) 
and 

J 

p(A|^,e Zi> A) = /(A|$zJ = n^i^w ( n ) 

i=l 

Again, there are no global parameters A. 

Letting Qq = a = (ai, 02, ... , aj), where for j = 1, 2, . . . , J, atj = (a^i, 0^2, • • • , o-j,Kj) and 
adopting conjugate Dirichlet priors <3? CJ = (0 C j,i, 0cj,2, • ■ • , <f>c,j,Kj) ~ Dirichlet(oj), each <1> C 
can be updated directly using Gibbs updates. For full details of the posterior conditional 
distribution see Molitor et al. (2010). 

3.3. Mixed mixtures 

An alternative model is given by a mixture of some continuous and discrete random variables. 
Following the notation used above for Gaussian and discrete mixtures, for J\ continuous 
random variables and J2 discrete random variables, 

p(D i \Z i ,@ z .,A)=p(D}\v Zv Z Zi )p(tf\$z i ) (12) 

where Dj is the subset of the continuous random variables in A and Df is the subset of 
the categorical random variables in A- Note that we are assuming independence between 
continuous and categorical data conditional on the cluster allocations. 

3.4. Profile regression 

Recently, interest has grown in using DPMM as an alternative to regression models, non- 
parametrically linking a response vector Y to covariate data X through cluster membership. 
This idea has been pioneered by several authors including Dunson et al. (2008), Bigelow and 
Dunson (2009), Molitor et al. (2010), Papathomas et al. (2011), and Molitor et al. (2011). 
Our presentation is most similar to the latter three of these articles which refer to this idea 
as "profile regression". 

For the case of either Gaussian or Discrete mixtures, as described above, our implementation 
permits the joint modelling of a response vector, for various response models which we present 
below. Formally, the data D = (Y, X, W) is now extended to contain response data Y{ and 
covariate data A, for each individual i, where the contribution of the covariate data to the 
response may be cluster dependent. There is also the possibility to include additional fixed 
effects Wi for each individual, which are constrained to only have a global (i.e. non-cluster 
specific) effect on the response Yi. 

The data A is then jointly modelled as the product of a response model and a covariate 
model, to give the following likelihood: 

p( A I Z u ®, A) = fy {Yi I @ Zi , A, Wjfx (Xi I @ Zi , A) . 

The covariate likelihood fx is of either of the forms presented in Sections 3.1 or 3.2. The 
likelihood fy depends upon the choice of response model. 

Binary response 

Adopting a binary response model, each parameter vector G c is extended to include an 
additional parameter 9 C . We also introduce the global parameter vector A = /3, of the same 
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length L as the fixed effects vector Wi, to capture the contribution of these effects. Then, 
f Y (Yi\@ Zi ,A, Wi) =p(Yi = l\e Zi ,P,Wi) is given by 

logit{p(^ = l\6 Zi ,P, Wi)} := A< = 6 Zi + fWi. 

For each cluster c, we adopt a t location-scale distribution for 6 C , with hyperparameters \iq 
and oq with 7 degrees of freedom, as discussed by Molitor et al. (2010). Similarly, for each 
fixed effect I, we adopt the same prior for fy, but with hyperparameters fj,p and a p. 

The components of O c corresponding to the covariate model for X retain the possibility 
of being updated according to Gibbs samples. However, since conjugacy is not achieved 
with our prior choice, updating 8 C for each cluster (and /?/ for each fixed effect I) requires 
a Metropolis-within-Gibbs sample. In our implementation we propose the use of adaptive 
random-walk-Metropolis moves. 

Categorical response 

The categorical response model that we use is a simple extension of the binary response 
model of the previous section. In particular, each parameter vector C additionally contains 
an extra parameter vector 6 C = (6 c ,i, #c,2> • • • , QcR-i) of length R — 1, where R is the number 
of possible categories represented in the response data Y. Treatment of fixed effects is also 
extended, so that for each response category r = 1, 2, . . . , R — 1, there is a vector f3 r = 
(/3 r i, /3 ri 2, • • • , PrL,), where f3 r i is the coefficient for each fixed effect I (I = 1, 2, . . . , L). This 
gives f Y (Y i \® Zi ',A,W i )=p(Y i = r\0 Zi ,r,P,Wj as 

logit{p(y i = r\9 Zi ,/3, Wi)} = 9 Zi!r + f r Wi, for r = 1, 2, . . . , R - 1 
and p(Yi = 0\9 Zi ,P t Wi) = 1 - E?=?P( Y i = r\9 Zi ,P t Wi). 

In our sampler we use the same priors for each 9 c ^ r as for 6 C and p r j as for /?/ in the binary 
case, with the resulting observation about Metropolis-within-Gibbs updates remaining true. 

Binomial response 

By providing a number of trials Tj associated with each individual i (in this model an "in- 
dividual" might correspond to an area or "experiment") we can extend the binary response 
model to a Binomial response model. In particular, 

f Y (Y i \@ Zi ,A,W i )=p(Y i \e Zi ,P,W i )= Q^p?(i- Pi ) Ti - Yi , 

where 

logitfe} :=X i = e Zi +0 T W i . 

Priors used are identical to the binary case. Molitor et al. (2011) provide an example where 
this model is employed. 

Poisson response 

For count-type response data, an alternative to the Binomial model is the Poisson model. 
Under this model, each individual i is associated with an expected offset Ei, and the response 
is then modelled as 

Yi 

f Y (Yi\® Zi ,A,Wi) =p(Yi\e Zi ,p,Wi) = ^exp{- fM }, 
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where 

Hi = Ei exp{Aj}, for A* = 9 Zl + (3 T Wi. 
Prior models for 6 C and f3 are as above. 

Extra variation in the response 

For some of the above models, it is possible that we may wish to allow for extra variation 
in the response. Our sampler is designed to achieve this by alternatively modelling Aj (as 
defined in the above response models) by 

Ai = + (3 T Wi + £i, where ~ Normal(0, <rf ). 

Prior distributions for 9 C and (3 are unchanged, but in this model A contains an additional 
parameter, a^, for which prior specification is required. For simplicity we work in terms 
of the precision r £ = 1/<t|, and adopt a gamma distribution with shape parameter s Te and 
rate parameter r Te . This approach permits a simple Gibbs update of this parameter. In 
order to make inference about this model, it is also necessary to update the latent variables 
Aj at every sweep of the MCMC sampler. These parameters are considered an extension 
of A (as they are not directly associated with a specific cluster) and are therefore updated 
in Step F of the DPMM algorithm. Updates to these parameters are done using adaptive 
Random-walk-Metropolis steps. 

Gaussian response 

Our sampler is able to handle Normally distributed continuous response data. As for many 
of the discrete response models, O c is extended to contain 8 C for each c. As before A contains 
f3, but also ciy. These parameters allow us to write the response model as: 

f Y (Y i \& Zi ,A,W i )=p(Y i \e Zi ,P,al,W i ) = — ^exp{--^(y i - A,) 2 ), 
where Ai = 6 Zi + p T Wi. 

We impose the same prior settings as for the discrete response models, with the additional 
prior on ry = l/oy being Gamma(s ry , r^), where and are the shape and rate hyper 
parameters that extend 6o- Adopting this conjugate prior, updates for Ty are simple Gibbs 
updates. 

3.5. Variable selection 

In addition to fitting mixtures, potentially linking covariates and responses, it may addition- 
ally be of interest to determine which covariates actively drive the mixture components, and 
which share characteristics common to all components. This can be formulated as a question 
of variable selection. Below we present details of how this idea can be modelled, first in the 
case of discrete covariates, as considered by Papathomas et al. (2012), and then in the context 
of Gaussian covariates, a formulation which, as far as we are aware, has not been reported 
elsewhere. Variable selection for DPMM has also been considered by Chung and Dunson 
(2009). 
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Discrete covariates 

Following the approach taken by Papathomas et al. (2012) our sampler implements two types 
of variable selection. We outline these approaches briefly in this section but for full details 
the reader is referred to this paper. 

The first approach is a type of "soft" variable selection, where for each covariate j, there is an 
associated variable Q taking values in [0, 1], which informs whether the variable j is important 
in terms of supporting a mixture distribution. Letting (j>oj t k be the observed proportion of 
covariate j values taking the value k throughout the whole covariate dataset X, we define the 
new composite parameters 

4>tj,k := Cj<t>c,j,k + (1 - Cj)4>o,j,k- 

The likelihood model is then as in Equation 11 but with (f>Zi,j,k replaced by 4>z jk- Under 
this model the vector A is extended by £ = (£i, . . . , (j). We replicate the prior structure 
presented in the original paper, in particular, inducing sparsity with the introduction of 
additional binary variables pj for each j = 1, 2, . . . , J, which also extend A. For extended 
details of the conditional posteriors and updating strategy the interested reader is referred to 
Papathomas et al. (2012). Of particular note is the fact that conjugacy for <f> c _j is no longer 
retained, meaning that Metropolis-within-Gibbs updates are necessary. We use adaptive 
Random- walk-Metropolis proposals. 

An alternative approach to the variable selection presented above, is a more traditional "hard" 
(binary) cluster specific variable selection. Under this model, each mixture component c has 
an associated vector 7 C = (7 c ,i, 7c,2> • • • , 7c, j), where 7 c j is a binary random variable that 
determines whether covariate j is important to mixture component c. This allows us to write 

€,j,k ■= 7c,i0 c ,i,fc + (i - icj)<hj,k = - </>c,j,fc) {1 " 7c ' j) , 

which, as in the first variable selection model, is substituted into Equation 11 in place of <ft c ,j,ki 
to provide the likelihood for the covariate model. Under this model, each parameter vector O c 
is extended by 7 C . We use the same prior specification (and hence posterior conditionals) as 
the original paper, in particular introducing additional parameters pj for each j = 1, 2, . . . , J 
into A. We also observe that using this alternative approach, the binary nature of j C} j means 
that direct Gibbs updates for 4> c ,j,k can still be used (albeit with slightly altered Dirichlet 
distributions). 

Gaussian covariates 

The two variable selection methods described above can equally be applied to the Gaussian 
mixture case. Defining x = (x\, X2, ■ ■ ■ , xj), where Xj = ^ Xrft=i x i,j ^ s the average sample 
value of covariate j, we can define the vector p* c = (fx* l5 /x* 2 , . . . , p,* j) where either 

Mc,j = Cjf^Cfjfk "l~ (1 — Cj) x ji 

or 

<j : = TcjMcj + (1 - 7c,i)%, 

depending on which variable selection approach is being adopted. We then replace p c j with 
p*j in Equation 10. Using identical priors for Q or j c j and pj, these parameters are updated 
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as for the discrete covariate case. The posterior conditional distributions for updating \i c are 



The rich posterior output produced can be used to learn about the partition space and its 
uncertainty. It is useful to show a representative partition, as an effective way to convey the 
output of the clustering algorithm. Moreover, it is also of interest to assess the uncertainty 
associated with subgroups of this best partition. 

We discuss below the necessary steps. See also Molitor et ol. (2010). 

1. Computation of the dissimilarity matrix. At each iteration of the sample, we can record 
pairwise cluster membership and construct a score matrix, with entries equal to 1 for 
pairs belonging to the same cluster and otherwise. Averaging these matrices over the 
whole MCMC run leads to a similarity matrix S, which can be then used to identify an 
optimal partition. 

2. Identifying the optimal partition. We have implemented two deterministic clustering 
procedures to characterise the optimal partition. The first is to use find the best par- 
tition by choosing the one which minimises the least-square distance to the matrix S. 
This approach is fast, but susceptible to Monte Carlo error. The second procedure 
implemented in the package is partitioning around medoids (PAM) on the dissimilarity 
matrix 1 — S. This procedure is available in R in the package cluster and it robustly 
assigns individuals to clusters in a way consistent with matrix S. PAM is implemented 
for each possible number of clusters up to a specified maximum, and for each fixed num- 
ber of clusters the best PAM partition is selected. A final representative cluster is then 
chosen by maximising the average silhouette width across these best PAM partitions. 

3. Computation of the average risk and profile and the corresponding credible intervals. 
Given an optimal partition P* obtained as above, we examine the MCMC output to 
assess whether or not the model consistently clusters individuals in a manner similar to 
P* . Consistent clustering leads to narrower credible intervals. 

For example, for Bernoulli response, we obtain a distribution of the baseline risks for 
each cluster defined by P*. At each iteration of the sampler we compute the average of 
baseline risks p Zi for all individuals within a particular cluster k of the optimal partition. 
This average baseline risk for cluster k is computed as follows: 



where denotes the number of individuals in cluster k. In a similar way we can 
compute the distribution of cluster parameters for other response and covariates types. 



shown in Appendix A. 3, whereas those for updating other parameters are as before, with \i Cj 
replaced with fj,* ■ as appropriate. 



4. Postprocessing of the MCMC output 




5. Mixing of the MCMC algorithm 
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The likelihood of the DPMM is invariant to the order of cluster labels but the prior specifi- 
cation of the stick breaking construction is not. Therefore, to ensure adequate mixing across 
orderings, it is important to include label-switching moves. In this package we have imple- 
mented the two label switching moves proposed by Papaspiliopoulos and Roberts (2008) as 
well as a third label switching move proposed by (Hastie et al. 2013). This latter move updates 
the new cluster weights so they are something like their expected value conditional upon the 
new allocations. See (Hastie et al. 2013) for more details on the performance of this sampler. 

Even with these label switching moves, convergence may be problematic and the user must 
address this issue using diagnostic tools. One difficulty in this respect is that there are no 
parameters in the model that can reliably demonstrate convergence. The parameters of the 
fixed effects tend to converge very quickly, regardless of the underlying clustering, as they are 
not cluster specific and therefore are not a good indication of the overall convergence. The 
cluster parameters, such as the c 's, cannot be tracked as their number (and their labels) 
can change from one iteration to the next. The concentration parameter a is not a reliable 
indicator of convergence either, as discussed in Hastie et al. (2013). 

To overcome this challenge, we have implemented the computation of the marginal model 
posterior p(Z|D) as an additional diagnostic tool. This represents the posterior distribution 
of the allocations given the data, having marginalised out all the other parameters (Hastie 
et al. 2013). The marginal model posterior is computed for each run of the MCMC and it 
has proved very effective for our real examples to compare runs with different initialisations 
and identify runs that were significantly different from others. Our experience suggests that 
it is harder for the MCMC algorithm to split rather than merge clusters. This means that 
it is important to initialise the algorithm with a number of cluster which is greater than the 
number of clusters that the algorithm will convergence to. The marginal model posterior can 
help to assess what such number is for each specific example. 

Finally, while optimal partitions allow visualisation of the result of a clustering algorithm, 
such an approach must be applied with care as we are not aware of any effective method to 
quantify nor visualise clustering uncertainty. For this reason, we advise using predictions as 
an additional tool to assess convergence and visualise the output of the algorithm, as their 
posterior distributions can be compared across runs using standard methods. More details 
of using the package for predictions can be found in Section 7. We have observed that these 
posterior predictive distributions tend to be more stable than optimal partitions. 

6. Software 

Our implementation of the DPMM algorithm is available as an R package from CRAN. The 
software is primarily written in C++ and R. 

The sampler implements the algorithm exactly as detailed in the current paper, although 
continued work is in progress to extend the scope of the software to cover additional models. 

The program is further customisable through the specification of a hyperparameters, providing 
name-value pairs for the various hyperparameters used within the model being run. If the 
value is not set for a specific hyperparameter, it takes its default value. Default values can 
be found within the full documentation that is available as part of the software. 

Moreover, this package can produce predicted values based on random allocations, or a Rao- 
Blackwellised estimate of predictions, where the probabilities of allocations are used instead 
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of actually performing a random allocation. 



7. Examples 



7.1. Simulated example 

We simulated 1,000 subjects, partitioned into 5 groups in a balanced manner. Ten binary 
covariates were considered. To demonstrate one of the variable selection approaches within 
the sampler, only the first eight covariates support a clustering structure. The response is 
binary. This dataset can be simulated as follows. 

fi> library (PReMiuM) 

R> inputs <- generateSampleDataFile(clusSummaryVarSelectBernoulliDiscrete() ) 

We use the default values for all hyperparameters: Dirichlet conjugate priors with a,- = 1 for 
the covariates and 



We initialised all chains allocating subjects randomly to 20 groups. We run the chain for 
10,000 after a burn-in sample of 20,000 iterations. While ensuring convergence is a complex 
problem, we have observed good stability in all our runs, with results from independent chains 
virtually identical. 

R> runlnfoObj <- profRegr(yModel=inputs$yModel, xModel=inputs$xModel , 
nSweeps=10000 , nBurn=20000, data=inputs$inputData, 
output= "output ", covNames=inputs$covNames , nClusInit=20, 
fixedEffectsNames=inputs$fixedEffectNames , run=TRUE) 

R> dissimObj <- calcDissimilarityMatrix (runlnfoObj) 

R> clusObj <- calcOptimalClustering(dissimObj) 

R> riskProfileObj <- calcAvgRiskAndProfile(clusObj) 

R> cluster Order Ob j <- plotRiskProf He (riskProfileObj , ' summary- sim.png' ) 

Figure 1 shows a box-plot of the posterior distribution for the probabilities of the response 
and the covariates for the 5 clusters that form the representative clustering. 

7.2. Predictions 

PReMiuM can produce predicted values based on simple allocations (the default), or a Rao- 
Blackwellised estimate of predictions, where the probabilities of allocations are used instead 
of actually performing a random allocation. The following code can be used to reproduce the 
predictive distribution plotted in 2. The predictions are consistent with the simulated data. 



p(a) 
P(0 C ) 



Gamma(l, 0.5) 

t 7 (0,2.5) 

t 7 (0,2.5). 
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Figure 1: Posterior distribution for the probabilities of the binary response for the represen- 
tative clustering. 
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i 1 1 1 1 1 i 1 1 1 1 1 

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 



[0,0,1,0,0] [0,0,1, NA,0] 

Figure 2: The predictive distribution for two scenarios. The covariate values for each predic- 
tive scenarios are given below each graph. 'NA' represents a missing value. 

fi> # generate data 

fi> inputs <- generateSampleDataFile(clusSummaryBernoulliDiscrete() ) 
R> # set predictive scenarios 

R> preds <- data . frame (matrix (c (0, 0, 1, 0, 0, 0, 0, 1, NA, 0) , 

ncol=5, byrow=TRUE) ) 
R> colnames (preds) <- names ( input s$inputD at a) [2: 

(length (inputs$covNames)+l)] 
R> # run profile regression 

R> runlnfoObj <- prof Re gr (yModel= input s$yModel , xModel= input s$xModel , 
nSweeps=10000 , nBurn=10000 , data=inputs$inputData, output=" output " , 
covNames=inputs$covNames , predi ct=preds) 

R> # postprocessing 

R> dissimObj <- calcDissimilarityMatrix (runlnfoObj) 

R> clusObj <- calcOptimalClustering(dissimObj) 

R> riskProfileObj <- calcAvgRiskAndProfile(clusObj) 

R> predsOutput <- calcPredictions (riskProfileObj , fullSweepPredictions=T) 
R> # prediction plots 
R> par(mfrow=c(l,2)) 
R> for (i in 1:2){ 

R> hist (predsOutput$predictedYPerSweep [ , i,l] ) 
R> } 



7.3. Variable selection 

Note that covariates 9 and 10 in Figure 1 have similar profile probabilities for all clusters, as 
they have been simulated not to affect the clustering. The variable selection approach will 
tease out the covariates that contain clustering support and exclude the others from affecting 
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Figure 3: The sampled values of p p for each covariate. 

the clustering. 

We initialised the chains as in the simulated example above, with additional prior specifica- 
tions given by 

p(pp) = Beta(0.5,0.5). (13) 

The algorithm consistently sampled values p p in accordance with the simulated data, as shown 
in Fig. 3. This figure can be reproduced as follows. 

R> inputs <- generateSampleDataFile(clusSummaryVarSelectBernoulliDiscrete() ) 
R> runInfoObj<-profRegr(yModel=inputs$yModel, xModel=inputs$xModel , 

nSweeps=10000 , nBurn=10000 , data=inputs$inputData, 

output= "output " , covNames=inputs$covNames , 

varSelectType="BinaryCluster") 

R> rho <- summariseVarSelectRho(runlnfoObj) 
R> par(mfrow=c(5,2)) 

R> for (k in 1 :runInfoObj$nCovariates){ 

R> hist(rho$rho[,k] , xlim=c(0 , 1) , main="") 

R> } 



7.4. Marginal Model Posterior 

Hastie et ol. (2013) introduce the marginal model posterior as a tool to assess convergence for 
Dirichlet process mixtures. The marginal model posterior can be computed in PReMiuM. The 
code below computes the marginal model posterior for four different runs of profile regression 
on the same dataset with different initialisations - different number of initial clusters. As it 
can be seen in Figure 4, plotted using the code below, for the given simulated dataset, all 
the MCMC runs appear to converge to subsets of the model space with equivalent marginal 
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model posterior. This does not imply convergence, but it is a useful diagnostic tool as it can 
highlight a lack of convergence in certain circumstances (Hastie et al. 2013). 

fi> # generate data 

R> inputs <- generateSampleDataFile(clusSummaryBernoulliDiscrete() ) 

R> # select 4 different starting points 
R> nClusInit<-c(10,20,50,75) 

R> # run profile regression for all four starting points 

R> # and compute marginal model posterior 

R> for (i in 1 : length (nClusInit) ) { 

R> runInfoObj<-profRegr(yModel=inputs$yModel, 

xModel=inputs$xModel , nSweeps=10000 , 

nBurn=10000, data=inputs$inputData, 

output=paste (" output / init " ,nClusInit [i] , sep=" ") , 

covNames = input s$covNames , alpha=l, 

fixedEffectsNames = inputs$fixedEffectNames,nClusInit=nClusInit [i] ) 
R> margModelPosterior(runlnfoObj) 
R> } 

R> # retrieve marginal model posterior output 
R> mmp<-list() 

R> for (i in 1 : length (nClusInit) ) { 

R> mmp [ [i] ] <-read . table (paste ("output/ init " , nClusInit [i] , 

"_margModPost . txt " , sep= ""))[,!] 

R> } 

R> # draw plot 

R> png (" plots .png") 

R> plot (c (head (nClusInit ,n=l) -0 . 5 , tail (nClusInit ,n=l) +0.5) , 

c(min (unlist (mmp)) ,max(unlist (mmp))) , type="n" , 

ylab="Log marginal model posterior" , 

xlab="Initial number of clusters" , 

cex . lab=l . 3,xaxt= "n ") 
R> axis (1 , at=nClusInit , labels=nClusInit) 
R> for (i in 1 : length (nClusInit) ) { 
R> boxplot (mmp [ [i] ] ,add=T,at=nClusInit [i] ,pch=" . " , 

boxwex=5 , col="lightblue") 

R> } 

R> dev.offO 
7.5. Run times 

We have run simulations to test PReMiuM's speed and how it scales when the number of 
subjects or covariates increases. The code was run in serial on an Intel(R) Xeon(R) CPU 
E5-2650 clocked at 2.00GHz with 20MB L3 cache, on a system with 64GB RAM. 
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10 20 50 

Initial number of clusters 



Figure 4: The marginal model posterior for four runs of profile regression, on the same dataset 
but with different initialisations. 



Table 1: Time to run 100 iterations of PReMiuM for Bernoulli response and Discrete covari- 
ates 





Number of covariates 


Number of Subjects 


100 1,000 10,000 


1,000 


4 sec 1 min 13 min 


2,500 


11 sec 1.4 min 26 min 


5,000 


32 sec 3.5 min 32 min 



Conclusions 

The structure of PReMiuM objects gives rise to a wider variety of uses than can be described 
in detail here. Our intention was to provide a tutorial for Dirichlet process clustering and to 
illustrate the basic features of the sampler and post-processing tools that we have implemented 
in PReMiuM to demonstrate its utility. Our long-term goal is to continue to develop this 
package for analysis on complex and high dimensional datasets as well to increase the flexibility 
with regards to the data types that can be analysed. 
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Table 2: Time to run 500 iterations of PReMiuM for Normal response and Normal covariates 





Number of covariates 


Number of Subjects 


50 


100 


250 


34 sec 


4 min 


500 


51 sec 


5 min 


1,000 


1.3 min 


7 min 



A. Appendices 



A.l. 

We provide the following proposition to support our assertions regarding C*. 

Proposition 1. Suppose that we have a model with posterior as given in Equation 7. Suppose 
Z* ,U* and C* are defined as in Section 2.2. Then: 

(i) "0c < Ui for all i = 1,2, ... ,n and all c > C* almost surely; 

(ii) C* > Z* almost surely; and 
(Hi) C* < oo almost surely. 

Proof. We rely on the fact that if Vi,V c ~ Beta(l,a) , tf>i = V\ and vb c = V C \\ 1<C {1 — V c ) 
for c = 2, 3, . . . then ^c^=i V'c = 1- A proof of this result for the DPMM (in terms of more 
general conditions) is provided by Ishwaran and James (2001). Then: 

(i) By definition, for alH = 1, 2, . . . , n 

C* oo 

Ui>u* > i-^Vc= Yl V> c >Vv Vc' > C*. 

c=l c=C*+l 

(ii) Let i* be an individual i such that Z$ = Z*. Again, by definition, U* < Ui* < ipz*- 
This implies 1 — ipz* < 1 — J7*, meaning 

z*-i 

^2il> e <l-i/>z*<l-U*. 

c=l 

By definition of C*, this implies C* > Z* almost surely. 

(Hi) Since 5Zc=i V'c = 1, this is a convergent series. By definition of a convergent series and 
because U* > we have C* < oo almost surely. 

□ 

A.2. 

Below are additional comments to explain the blocking strategy employed in the DPMM 
algorithm. We use '-'to denote "all other parameters and data". 
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Step A. This step is a straightforward calculation of Z*, which (potentially) changes at each 
iteration (with the update of Z). The set A is defined immediately conditional on this 
value. 

Step B. This is a joint update of U and the parameters corresponding to the active com- 
ponents in A, with the inclusion of label switching moves. The principle is to use the 
identity p(V A , @ A , Z, U\-) = p(V A , @ A , Z\-)p{U\V A , ® A , Z, ■). We proceed by first 
updating (V A ,@ A ) ~ p{V A ,& A \Z,-) = f p(V A ,@ A ,U\Z,-)dU. Due to the condi- 
tional independence of V and 0, this can be done in two steps: updating V (B.l) then 
updating (B.2). The moves are presented as Gibbs updates, but in fact they are 
Metropolis-Hastings moves, where the variable of interest (for example V A ) is sampled 
from its full conditional, and the other variables (for example 0^ and Z) are kept fixed. 
This results in an acceptance probability of 1, making the Gibbs update equivalent. The 
updated values are then used as interim values for V and in (Metropolis-Hastings) 
label-switching moves (see Section 5) which are applied in B.3. The moves can change 
the values of V as well as their order. The resulting sample of V and are the final 
updated values of these parameters for this sweep. The updated allocation vector Z 
is used as an interim value throughout the remainder of the steps of the sweep, before 
the final updated value of Z is sampled in Step G. The final part (B.4) of Step B is to 
update U conditional upon the updated value of V and and the interim value of Z. 

B.l Integrating out U and taking advantage of the conjugacy of the distribution for V 
inherent in the DPMM, along with the conditional independence structure, each 
component of vector V A is updated by sampling V c ~ Beta(l + n c , a + c G A, 
where n c = 1 {z,=c} and ra+ = £\ l{ Zl>c }- 

B.2 Integrating out U and taking advantage of the conditional independence structure, 
0^ is updated from p(@ A \Z, Qq, D). The full details of this will depend upon the 
application and the choice of / and P@ • Examples are given in Sections 3. 

B.3 This step implements the Metropolis-Hastings label-switching moves detailed in 
Section 5. These moves update V A , ® A and Z jointly from their conditional 
distribution with U integrated out. These moves are conditional upon the values 
of V A and 0^ sampled in steps B.l and B.2. 

B.4 Conditioning on the updated values of V A ,® A and Z from step B.3, this step 
samples each Ui, i = l,...,n, independently according to the full conditional 
distribution, U ~ Unif [0, = Unif[0, Vz i IIz<z-(l — ^)]> as detailed in Walker 
(2007). 

Step C. To compute U* is straightforward given the updated value of U from step B.JL. The 
value of Z* (and with it the set A) can only change from that computed in Step A if the 
mixture component corresponding to the old Z* was involved in a label switching move, 
and then only if the component it was switched with was empty. By design of the label 
switching moves (see Section 5) this means that Z* and A can only get smaller, with 
the consequence that parameters corresponding to a small number of components may 
be updated twice per MCMC sweep (once in Step B as part of the active components 
A, and once in Steps D and E as part of the updated potential components P). This 
has no ill-effects as long as the most recently updated parameter values are used at each 
subsequent step. 
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Step D. This is a joint update of a, V p and V 1 . The principle is to use the identity 
p(a, V p , V 1 ]-) = p(a\-)p(V p , V ! \a, •). We proceed by first updating a ~ = 
/ p(a,V p ,V I \-)dV p dV I (step D.J) and then sampling p(F p , V J |a, •) (step D.2). To 
update we need to alternate Gibbs samples with checks to evaluate whether the 
component just updated is C*. In this way the set P is determined on the fly. As 
mentioned in Section 2.1, no actual sampling is done for the inactive components in set 
I as these would just be samples from the prior and have no impact on the likelihood 
or any other conditionals in the MCMC sweep. 

If the prior for a is a Gamma distribution then it is alternatively possible to sample 
directly from the conditional with V A also marginalised (see Walker, 2007 and Escobar 
and West, 1995 for details). We retain our version as any prior for a can be then used. 

D.l Since V p and V 1 both correspond to empty mixture components, the only con- 
tribution to the joint posterior conditional is through the prior. This allows us to 
easily integrate out V p and V 1 . Due to the conditional independence, the result- 
ing posterior from which this step samples is p(a\V , Z). Typically this cannot 
be sampled directly, so we employ a Metropolis-within-Gibbs move to update a, 
using an adaptive random-walk-Metropolis proposal on the log-scale. 

D. 2 We begin by setting C = Z*. We then repeat the following two steps until the stop 

condition is reached. First, check if Ylc=i ^ c > 1 — Next, if the condition is met 
we set C* = C and stop, otherwise we set C = C + 1 and sample Vc ~ Beta(l, a). 

Step E. This step updates the parameters ® p and & 1 from the distribution p(® p 

The set P is fully determined from step D.2. The parameters correspond to empty 
mixture components, so updated values of & p are sampled directly from the prior. As 
with other inactive parameter 0^ play no part in this MCMC sweep and so are not 
updated. 

E. 1 Taking advantage of the conditional independence structure, in this step we update 

P by doing a Gibbs sample from the prior, such that C ~ p(0 c |©o) for each 
c G P. As C may be a vector of parameters, this may involve a number of Gibbs 
updates per component c. The full details depend upon the choice of Pe . See 
Section 3 for examples. 

Step F. Here, the global (non-cluster-specific) likelihood parameters A associated with / are 
updated. There are only a finite number of such parameters so no special updates are 
needed. 

F. l Sample A ~ p(A\® A ,Z,D). Due to the conditional independence structure of 

the model the update only depends on the current value of the active likelihood 
parameters ® A , the allocations Z, and the data D. A may contain multiple 
parameters, so this stage may contain many Gibbs and / or Metropolis-within- 
Gibbs steps. Full details will depend upon the choice of / and p(A), see Section 3 
for examples. 

Step G. The final step of the algorithm is to update the parameter allocations Z, conditional 
on the newly updated values of the other parameters. 
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G.l We sample Z ~ p(Z\V A , V p , ® A , ® p , U, A, D). Because of the independence of 
the individuals i, this is a series of Gibbs updates for each i = 1, 2, . . . , n, sampling 
Zi ~ p(Z{ = c\ V A , V p , ® A , ® p , Ui, A, Di) where D{ is the data for individual 
i. For each update, since the conditional p{Zi = c|-) has no posterior mass for 
clusters c where Ui > ip c , this update depends only on the parameters associated 
with the finite number of clusters in the sets A and P, making the update a simple 
multinomial sample according to a finite vector of weights. The full details of the 
weights will depend upon the choice of / and Pe - 



A.3. 

In the case of variable selection for continuous covariates, define the J x J matrix T c as 

r [ic,j i = j; j = 1,2,..., J 
1 otherwise, 

for variable selection method 1, and 

r fO i = j; 3 = 1,2,..., J 

1 otherwise. 

for method 2 as presented in Section 3.5.2. Let Ij denote the J x J identity matrix, n c 
be the number of individuals allocated to cluster c, X be as defined in Section 3.5.2 and 
X c = (X Cj i,X C) 2, ■ ■ ■ ,X C; j) such that X c j = X^-z = c ^M'/ n c- The posterior conditional dis- 
tributions for updating \x c for c 6 A are then given by 

p ~ Normal(/i, S) 

where 

s = (So 1 + n c rE- 1 r)" 1 , 

and 

fi = £ [s„ Vo + n c rE- x (x c - (ij - r)x)] . 
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